NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

"One-off Events? An Empirical Study of Hackathon Code Creation and Reuse"

Mahmoud, A; Dey, T; Nolte, A; Mockus, A; Herbsleb, J (January 2022, Empirical software engineering)

Full Text Available
Deriving a usage-independent software quality metric

https://doi.org/10.1007/s10664-019-09791-w

Dey, T; Mockus, A (May 2020, Empirical software engineering)

Context The extent of post-release use of software affects the number of faults, thus biasing quality metrics and adversely affecting associated decisions. The proprietary nature of usage data limited deeper exploration of this subject in the past. Objective To determine how software faults and software use are related and how, based on that, an accurate quality measure can be designed. Method Via Google Analytics we measure new users, usage intensity, usage frequency, exceptions, and release date and duration for complex proprietary mobile applications for Android and iOS. We utilize Bayesian Network and Random Forest models to explain the interrelationships and to derive the usage independent release quality measure. To increase external validity, we also investigate the interrelationship among various code complexity measures, usage (downloads), and number of issues for 520 NPM packages. We derived a usage-independent quality measure from these analyses, and applied it on 4430 popular NPM packages to construct timelines for comparing the perceived quality (number of issues) and our derived measure of quality during the lifetime of these packages. Results We found the number of new users to be the primary factor determining the number of exceptions, and found no direct link between the intensity and frequency of software usage and software faults. Crashes increased with the power of 1.02-1.04 of new user for the Android app and power of 1.6 for the iOS app. Release quality expressed as crashes per user was independent of other usage-related predictors, thus serving as a usage independent measure of software quality. Usage also affected quality in NPM, where downloads were strongly associated with numbers of issues, even after taking the other code complexity measures into consideration. Unlike in mobile case where exceptions per user decrease over time, for 45.8% of the NPM packages the number of issues per download increase. Conclusions We expect our result and our proposed quality measure will help gauge release quality of a software more accurately and inspire further research in this area.
more » « less
Full Text Available
A Complete Set of Related Git Repositories Identified viaCommunity Detection Approaches Based on Shared Commits

Mockus, A; Spinellis, D.; Kotti, Z; Dusing, G (June 2020, IEEE International Working Conference on Mining Software Repositories)

In order to understand the state and evolution of the entirety of open source software we need to get a handle on the set of distinct software projects. Most of open source projects presently utilize Git, which is a distributed version control system allowing easy creation of clones and resulting in numerous repositories that are almost entirely based on some parent repository from which they were cloned. Git commits are unlikely to get produce and represent a way to group cloned repositories. We use World of Code infrastructure containing approximately 2B commits and 100M repositories to create and share such a map. We discover that the largest group contains almost 14M repositories most of which are unrelated to each other. As it turns out, the developers can push git object to an arbitrary repository or pull objects from unrelated repositories, thus linking unrelated repositories. To address this, we apply Louvain community detection algorithm to this very large graph consisting of links between commits and projects. The approach successfully reduces the size of the megacluster with the largest group of highly interconnected projects containing under 400K repositories. We expect that the resulting map of related projects as well as tools and methods to handle the very large graph will serve as a reference set for mining software projects and other applications. Further work is needed to determine different types of relationships among projects induced by shared commits and other relationships, for example, by shared source code or similar filenames.
more » « less
Full Text Available
Do code review measures explain the incidence of post-release defects?

https://doi.org/https://doi.org/10.1007/s10664-020-09837-4

Krutauz, A.; Dey, T.; Rigby, P.C; Mockus, A (June 2020, Empirical software engineering)

Aim In contrast to studies of defects found during code review, we aim to clarify whether code review measures can explain the prevalence of post-release defects. Method We replicate McIntosh et al.’s (Empirical Softw. Engg. 21(5): 2146–2189, 2016) study that uses additive regression to model the relationship between defects and code reviews. To increase external validity, we apply the same methodology on a new software project. We discuss our findings with the first author of the original study, McIntosh. We then investigate how to reduce the impact of correlated predictors in the variable selection process and how to increase understanding of the inter-relationships among the predictors by employing Bayesian Network (BN) models. Context As in the original study, we use the same measures authors obtained for Qt project in the original study. We mine data from version control and issue tracker of Google Chrome and operationalize measures that are close analogs to the large collection of code, process, and code review measures used in the replicated the study. Results Both the data from the original study and the Chrome data showed high instability of the influence of code review measures on defects with the results being highly sensitive to variable selection procedure. Models without code review predictors had as good or better fit than those with review predictors. Replication, however, confirms with the bulk of prior work showing that prior defects, module size, and authorship have the strongest relationship to post-release defects. The application of BN models helped explain the observed instability by demonstrating that the review-related predictors do not affect post-release defects directly and showed indirect effects. For example, changes that have no review discussion tend to be associated with files that have had many prior defects which in turn increase the number of post-release defects. We hope that similar analyses of other software engineering techniques may also yield a more nuanced view of their impact. Our replication package including our data and scripts is publicly available (Replication package 2018).
more » « less
Full Text Available
Companies' Participation in OSS Development-An Empirical Study of OpenStack

https://doi.org/10.1109/TSE.2019.2946156

Zhang, Y.; Zhou, M.; Mockus, A.; Jin, Z. (May 2020, ICSE' 2020)
null (Ed.)
Full Text Available
A Methodology for Analyzing Uptake of Software Technologies Among Developers

https://doi.org/https://doi.org/10.1109/tse.2020.2993758

Ma, Y; Mockus, A; Zaretzki, R; Bichescu, B; Bradley, R (May 2020, IEEE transactions on software engineering)

Motivation: The question of what combination of attributes drives the adoption of a particular software technology is critical to developers. It determines both those technologies that receive wide support from the community and those which may be abandoned, thus rendering developers' investments worthless. Aim and Context: We model software technology adoption by developers and provide insights on specific technology attributes that are associated with better visibility among alternative technologies. Thus, our findings have practical value for developers seeking to increase the adoption rate of their products. Approach: We leverage social contagion theory and statistical modeling to identify, define, and test empirically measures that are likely to affect software adoption. More specifically, we leverage a large collection of open source repositories to construct a software dependency chain for a specific set of R language source-code files. We formulate logistic regression models, where developers' software library choices are modeled, to investigate the combination of technological attributes that drive adoption among competing data frame (a core concept for a data science languages) implementations in the R language: tidy and data.table. To describe each technology, we quantify key project attributes that might affect adoption (e.g., response times to raised issues, overall deployments, number of open defects, knowledge base) and also characteristics of developers making the selection (performance needs, scale, and their social network). Results: We find that a quick response to raised issues, a larger number of overall deployments, and a larger number of high-score StackExchange questions are associated with higher adoption. Decision makers tend to adopt the technology that is closer to them in the technical dependency network and in author collaborations networks while meeting their performance needs. To gauge the generalizability of the proposed methodology, we investigate the spread of two popular web JavaScript frameworks Angular and React, and discuss the results. Future work: We hope that our methodology encompassing social contagion that captures both rational and irrational preferences and the elucidation of key measures from large collections of version control data provides a general path toward increasing visibility, driving better informed decisions, and producing more sustainable and widely adopted software.
more » « less
Full Text Available
ALFAA: Active Learning Fingerprint based Anti-Aliasing for correcting developer identity errors in version control systems.

https://doi.org/10.1007/s10664-019-09786-7

Amreen, S; Mockus, A; Zaretzki, R; Bogart, C; Zhang, Y (January 2020, Empirical software engineering)

An accurate determination of developer identities is important for software engineeringresearch and practice. Without it, even simple questions such as “how many developers doesa project have?” cannot be answered. The commonly used version control data from Git isfull of identity errors and the existing approaches to correct these errors are difficult to vali-date on large scale and cannot be easily improved. We, therefore, aim to develop a scalable,highly accurate, easy to use and easy to improve approach to correct software developer identity errors. We first amalgamate developer identities from version control systems in open source software repositories and investigate the nature and prevalence of these errors,design corrective algorithms, and estimate the impact of the errors on networks inferred from this data. We investigate these questions using a collection of over 1B Git commits with over 23M recorded author identities. By inspecting the author strings that occur most frequently, we group identity errors into categories. We then augment the author strings with three behavioral fingerprints: time-zone frequencies, the set of files modified, and a vector embedding of the commit messages. We create a manually validated set of identities for a subset of OpenStack developers using an active learning approach and use it to fit super-vised learning models to predict the identities for the remaining author strings in OpenStack.We then compare these predictions with a competing commercially available effort and a leading research method. Finally, we compare network measures for file-induced author net-works based on corrected and raw data. We find commits done from different environments,misspellings, organizational ids, default values, and anonymous IDs to be the major sources of errors. We also find supervised learning methods to reduce errors by several times in comparison to existing research and commercial methods and the active learning approach to be an effective way to create validated datasets. Results also indicate that correction of developer identity has a large impact on the inference of the social network. We believe that our proposed Active Learning Fingerprint Based Anti-Aliasing (ALFAA) approach will expedite research progress in the software engineering domain for applications that involve developer identities
more » « less
Full Text Available
Companies' Participation in OSS Development - An Empirical Study of OpenStack

https://doi.org/10.1109/TSE.2019.2946156

Zhang, Y; Zhou, M; Mockus, A; Jin, Z (October 2019, IEEE transactions on software engineering)

Commercial participation continues to grow in open source software (OSS) projects and novel arrangements appear to emerge in company-dominated projects and ecosystems. What is the nature of these novel arrangements' Does volunteers' participation remain critical for these ecosystems' Despite extensive research on commercial participation in OSS, the exact nature and extent of company contributions to OSS development, and the impact of this engagement may have on the volunteer community have not been clarified. To bridge the gap, we perform an exploratory study of OpenStack: a large OSS ecosystem with intense commercial participation. We quantify companies' contributions via the developers that they provide and the commits made by those developers. We find that companies made far more contributions than volunteers and the distribution of the contributions made by different companies is also highly unbalanced. We observe eight unique contribution models based on companies' commercial objectives and characterize each model according to three dimensions: contribution intensity, extent, and focus. Companies providing full cloud solutions tend to make both intensive (more than other companies) and extensive (involving a wider variety of projects) contributions. Usage-oriented companies make extensive but less intense contributions. Companies driven by particular business needs focus their contributions on the specific projects addressing these needs. Minor contributors include community players (e.g., the Linux Foundation) and research groups. A model relating the number of volunteers to the diversity of contribution, shows a strong positive association between them.
more » « less
Full Text Available
Detecting and Characterizing Bots that Commit Code

Dey, T; Mousavi, S; Ponce, E; Fry, T; Vasilescu, B; Filippova, A; Mockus, A. (January 2020, International Conference on Mining Software Repositories)

Full Text Available

Search for: All records